{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to recursively split text by characters\n",
    "\n",
    ":::info Prerequisites\n",
    "\n",
    "This guide assumes familiarity with the following concepts:\n",
    "\n",
    "- [Text splitters](/docs/concepts#text-splitters)\n",
    "\n",
    ":::\n",
    "\n",
    "This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `[\"\\n\\n\", \"\\n\", \" \", \"\"]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
    "\n",
    "1. How the text is split: by list of characters.\n",
    "2. How the chunk size is measured: by number of characters.\n",
    "\n",
    "Below we show example usage.\n",
    "\n",
    "To obtain the string content directly, use `.splitText`.\n",
    "\n",
    "To create LangChain [Document](https://v02.api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "  Document {\n",
      "    pageContent: \"Hi.\",\n",
      "    metadata: { loc: { lines: { from: 1, to: 1 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"I'm\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"Harrison.\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "import { RecursiveCharacterTextSplitter } from \"@langchain/textsplitters\";\n",
    "\n",
    "const text = `Hi.\\n\\nI'm Harrison.\\n\\nHow? Are? You?\\nOkay then f f f f.\n",
    "This is a weird text to write, but gotta test the splittingggg some how.\\n\\n\n",
    "Bye!\\n\\n-H.`;\n",
    "const splitter = new RecursiveCharacterTextSplitter({\n",
    "  chunkSize: 10,\n",
    "  chunkOverlap: 1,\n",
    "});\n",
    "\n",
    "const output = await splitter.createDocuments([text]);\n",
    "\n",
    "console.log(output.slice(0, 3));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll note that in the above example we are splitting a raw text string and getting back a list of documents. We can also split documents directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "  Document {\n",
      "    pageContent: \"Hi.\",\n",
      "    metadata: { loc: { lines: { from: 1, to: 1 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"I'm\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"Harrison.\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "import { Document } from \"@langchain/core/documents\";\n",
    "import { RecursiveCharacterTextSplitter } from \"@langchain/textsplitters\";\n",
    "\n",
    "const text = `Hi.\\n\\nI'm Harrison.\\n\\nHow? Are? You?\\nOkay then f f f f.\n",
    "This is a weird text to write, but gotta test the splittingggg some how.\\n\\n\n",
    "Bye!\\n\\n-H.`;\n",
    "const splitter = new RecursiveCharacterTextSplitter({\n",
    "  chunkSize: 10,\n",
    "  chunkOverlap: 1,\n",
    "});\n",
    "\n",
    "const docOutput = await splitter.splitDocuments([\n",
    "  new Document({ pageContent: text }),\n",
    "]);\n",
    "\n",
    "console.log(docOutput.slice(0, 3));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can customize the `RecursiveCharacterTextSplitter` with arbitrary separators by passing a `separators` parameter like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "  Document {\n",
      "    pageContent: \"Some other considerations include:\",\n",
      "    metadata: { loc: { lines: { from: 1, to: 1 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"- Do you deploy your backend and frontend together\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: \"r, or separately?\",\n",
      "    metadata: { loc: { lines: { from: 3, to: 3 } } }\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "import { RecursiveCharacterTextSplitter } from \"langchain/text_splitter\";\n",
    "import { Document } from \"@langchain/core/documents\";\n",
    "\n",
    "const text = `Some other considerations include:\n",
    "\n",
    "- Do you deploy your backend and frontend together, or separately?\n",
    "- Do you deploy your backend co-located with your database, or separately?\n",
    "\n",
    "**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.\n",
    "Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.\n",
    "\n",
    "## Deployment Options\n",
    "\n",
    "See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;\n",
    "\n",
    "const splitter = new RecursiveCharacterTextSplitter({\n",
    "  chunkSize: 50,\n",
    "  chunkOverlap: 1,\n",
    "  separators: [\"|\", \"##\", \">\", \"-\"],\n",
    "});\n",
    "\n",
    "const docOutput = await splitter.splitDocuments([\n",
    "  new Document({ pageContent: text }),\n",
    "]);\n",
    "\n",
    "console.log(docOutput.slice(0, 3));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "You've now learned a method for splitting text by character.\n",
    "\n",
    "Next, check out [specific techinques for splitting on code](/docs/how_to/code_splitter) or the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
